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Human-computer interactions benefit greatly from emotion recognition from 
speech. To promote a contact-free environment in this coronavirus disease 
2019 (COVID’19) pandemic situation, most digitally based systems used 
speech-based devices. Consequently, this emotion detection from speech has 
many beneficial applications for pathology. The vast majority of speech 
emotion recognition (SER) systems are designed based on machine learning 
or deep learning models. Therefore, need greater computing power and 
requirements. This issue was addressed by developing traditional algorithms 
for feature selection. Recent research has shown that nature-inspired or 
evolutionary algorithms such as equilibrium optimization (EO) and cuckoo 
search (CS) based meta-heuristic approaches are superior to the traditional 
feature selection (FS) models in terms of recognition performance. 
The purpose of this study is to investigate the impact of feature selection 
meta-heuristic approaches on emotion recognition from speech. To achieve 
this, we selected the rayerson audio-visual database of emotional speech and 


song (RAVDESS) database and obtained maximum recognition accuracy of 
89.64% using the EO algorithm and 92.71% using the CS algorithm. For this 
final step, we plotted the associated precision and F1 score for each of the 
emotional classes. 


This is an open access article under the CC BY-SA license. 


Corresponding Author: 


Kesava Rao Bagadi 

Department of Electronics and Communication Engineering, Mahatma Gandhi Institute of Technology 
CB Post, Gandipet, Hyderabad, Telangana, India-50075 

Email: bkesavarao_ece @ mgit.ac.in 


1. INTRODUCTION 

There are a variety of sources of information we can use to detect emotions in people, such as 
speech, transcripts, facial expressions, brain signals (EEG), and a combination of two or more of these 
(multi-modal emotion recognition). Among these, emotional recognition from the speech is an essential 
element in the field of human-computer interaction. The process of speech emotion recognition involves 
using acoustic analysis to identify vocal changes caused by emotions and then determining which features to 
use to determine an emotion’s presence [1]. However, many emotional databases contain either relevant or 
non-redundant information which can give low accuracy during classification. This issue can be addressed by 
applying effective feature selection (FS) methods to speech-based applications. Hence, it significantly 
improves the performance by the response time of the algorithm, which can turn to provide high 
classification accuracy. There are three main phases in the FS process. First, generate subset features from 
the whole set of databases, second is evaluation and finally validation [2]. 
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As said, these traditional FS models required high computational requirements and time on speech 
emotional databases. There are many traditional feature selection algorithms developed for selecting relevant 
features for emotional classification from a speech signal. One among them is filter and wrapper approaches done 
based on the criterion of information gain [3], mutual information [4] and principal component analysis [5] and 
so on. Alternatively, in the wrapper approach, a classifier is used, such as the K-nearest neighbour (KNN) [6] and 
support vector machine (SVM) [7], among others, to assess the quality of the resulting subsets. At the time of the 
generation phase, selecting all possible features that are extracted, yields more computational efforts and 
computation time. Hence, the traditional FS methods are not that much impressive to speech emotion 
recognition (SER) tasks. Then, research is finding another way to solve this issue using a nature-inspired 
optimization algorithm called a meta-heuristic approach. These meta-heuristic algorithms are very intelligent 
search algorithms and already implemented many artificial intelligence problems [8]. Recently some 
researchers adopted nature-inspired meta-heuristic algorithms to improve the recognition accuracy along with 
fewer computational requirements. Some well-known meta-heuristic algorithms are genetic algorithm (GA), 
ant-colony, cuckoo search (CS), particle swarm optimization (PSO) and grey wolf optimization (GWO) 
employed to achieve optimal feature sub-set for speech based emotional tasks [9]. In this paper, we addressed 
the key concern i.e impact of feature selection models using meta-heuristic approaches for (speech emiotion 
recognition) SER systems. An accurate classification model requires the appropriate generation of features, 
the selection of features, and the use of classification methods [10]. From this background, Figure 1 shows 
the role of feature selection methods for speech-based emotion recognition applications. 

The key contribution of this paper is summarized as studying the latest state-of-the-art meta-heuristic 
feature selection models for speech emotion recognition. Out of many heuristic approaches, analysis the 
impact of equilibrium optimization (EO) and CS algorithm for SER tasks. Finally, analyze the various 
performance metrics for the rayerson audio-visual database of emotional speech and song (RAVDESS) 
dataset towards speech emotion recognition. The rest of the paper is organized: section 2 provides the related 
work on SER using a meta-heuristic approach, materials such as speech emotional database used in this study 
describes in section 3, the methodology used for recognition of emotions from speech based on meta-heuristic 
focused in section 4, experimental results and analysis discussed in section 5, finally, section 6 gives conclusion 
and future perspective for this study. 
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Figure 1. General framework of SER system 


2. RELATED WORK 

Many academics and research centres work on automatic speech emotion recognition and 
concentrate more on FS algorithms to avoid computational requirements. Initially, a modified multi-objective 
genetic feature selection algorithm was proposed for speech emotion recognition by Brester et al. [11] and 
achieved improvement on Fl-score as 86.37% and 67.70% for the Berlin emotional speech database 
(EmoDB) and surrey audio-visual expressed emotion (SAVEE) databases respectively. Unlike content-based 
speech recognition systems, context-independent models use only signal parameters, classifiers consider 
these parameters as testing and training vectors [12]. The consistency of a feature selection algorithm is 
generated whenever new training samples are introduced or removed [13]. The selection of features that will 
identify important features is influenced by stability in knowledge discovery [14]. In [15], proposed a new 
approach of FS model using wrapper based PSO algorithm for SER tasks and achieve recognition rate up to 
78.44% for SAVEE database. One more Kozodoi et al. [16] presented a new framework for scoring credit 
information using genetic algorithms. Another one proposed cuckoo search in [17] and this algorithm gives 
an impressive result for SER tasks. Dey et al. [18] on SAVEE and EMoDB, the hybrid-based meta-heuristic 
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optimization FS model was found to achieve an accuracy of 97.31% and 98.45%, respectively. Very recently, 
Daneshfar et al. [19] proposed a novel approach of quntum behaved particle swarm optmization (QPSO) 
algorithm for emotion recognition from the speech on various datasets. Zhang [20] attempted the SER using 
a weighted binary cuckoo search algorithm and achieved an Fl-score of 83.80%. Another in [19] proposed 
particle swarm optimization (PSO) based on quantum behaviour for the dimensionality reduction of speech 
features. Compared to state-of-the-art algorithms, this method produced more accurate results. In all these 
works, researchers explored the various meta-heuristic optimization algorithms for SER tasks. Studies show that 
it is impossible to say that any feature selection method enables SER to improve or decrease performance. 
Features selection methods influence the success of SER depending on the classifier, the data, and the size 
reduction. With this literature analysis, we will address the impact of this FS methods on speech based emotion 
recognition. Finally, this work uses a public related speech emotional database i.e. RAVDESS for two different 
optimization algorithms such as EO and CS algorithm respectively. 


3. MATERIALS 
3.1. Speech emotional database 

The selection of a database is a crucial part of speech emotion recognition since the performance is 
determined by the naturalness of the database. In this paper, we have chosen a publicly available speech 
emotional database such as RAVDESS [21], which is in the English language. It contains various clipping 
profiles for both male and female speech samples of emotions such as anger, sadness, fear, excitement, 
happiness and neutral. A unique identification name is assigned to each sample in the dataset, and all samples 
are output as being either normal or strong in intensity. This study extracts features that contain the emotional 
information and selects the ones that are relevant for further processing and then classifies them using 
appropriate classifiers. 


3.2. Feature extraction 

System performance and accuracy are dependent on the signal feature extraction. The salient 
features of speech signals need to be extracted to identify different emotional states and speech styles. 
Generally, speech features are classified as acoustic features and spectral features. To analyze the speech 
signal, acoustic characteristics such as pitch, energy, zero crossing rates, an average Mel frequency cepstral 
coefficient (MFCC) as well as a discrete wavelet transform are extracted. Even in traditional or some other 
feature selection methods based on SER tasks, MFCCs features are one of the most prominent features to 
recognize emotion from speech accurately. It provides a way to characterize the properties of the voice 
signal. It was found that MFCC was superior in terms of speech recognition, as it helped in creating human 
perception compassion that takes frequencies into account. Here 12 primary discrete cosines transform (DCT) 
coefficients for emotions were considered as a feature vector to recognise emotions. The process of extracting 
MEFCCs is shown in Figure 2. In this work, we extracted MFCCs features using the openSMILE tool kit [22]. 
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Figure 2. Process of MFCC s feature extraction from speech 


3.3. Feature selection 

Over-fitting of machine learning algorithms occurs when the feature set dimension is large, resulting 
in low performance. For machine learning, FS has the objective of reducing the dimensionality of features 
and reducing the cost of classification. Unlike traditional feature selection methods; here we are selecting the 
optimal feature subset for emotion recognition based on meta-heuristic optimization algorithms i.e. EO and CS. 


3.4. Classifier 

Classification involves applying a machine-learning algorithm to train a dataset as well as 
identifying or classifying new observations, or a test data set. In this work, we used to classify emotions 
using an SVM classifier. SVM is the easiest and most popular classifier. As far as classification is concerned, 
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it creates a hyperplane between different types of data, which is an optimal boundary [23]. The strength of 
this SVM is not to suffer any multiple local minima. Hence, in this work, we are selected SVM as a classifier 
to recognize emotions from speech. 


4. METHODOLOGY 

Here, we attempted the impact of this meta-heuristic optimization algorithm like EO and CS on SER 
tasks. The framework of this FS model is shown in Figure 3. Initially, extract MFCCs features from the 
RAVDESS dataset using the openSMILE tool. Then, according to the principle of nature-inspired 
algorithms; first, generate the initial population of EO and CS algorithms. The purpose of this EO algorithm 
is to get both balancing and dynamic states from the control volume mass balance. 

Considering exploration and exploitation simultaneously, it has the advantage of maintaining a good 
balance [24]. Dynamic mass balances of control volume systems are modelled by this algorithm. Describes the 
general mass balance equation in which the change in mass over time equals the mass entering a system plus the 
mass leaving it. A successful optimization method is cuckoo search. Yang and Deb developed the CS, one of the 
latest nature-inspired meta-heuristic algorithms, in 2010 [25], employing isotropic random walks, rather than by 
simple selection. According to recent studies, CS is potentially far more efficient than PSO. From a mathematical 
perspective, the success of this algorithm is to solve n-dimensional linear/non-linear optimization problems with 
low-level mathematics been developed in solving binary optimization problems. 

Describes the general mass balance equation in which the change in mass over time equals the mass 
entering a system plus the mass leaving it. It is written as: 


dc 
V= QCeq — QC +G (1) 


; ; ; : dc . ; 
Whenever the control volume (V) is filled with C concentration, there is a value. V is the volumetric flow 


rate, Q, is the change in mass in the control volume, Ceq is the equilibrium concentration in the control 
volume under equilibrium condition without any generation, and is the mass generation rate inside the control 
volume. The initial population of an EO is also determined by the size and number of particles. A randomly 
generated initial population is represented by (2). 

pental = TiMiNmaxmin (2) 
Where ri”itial represents initial vectors of isp particle, fmin and Max are optimal and maximal particle 
concentrations, and rand; is between [0, 1] and n is the population size. Therefore, the equilibrium state 
concludes the optimization process since it optimizes globally. 

There is no knowledge of the equilibrium state at the beginning of the optimization process, so only 
potential candidates can be determined. The equilibrium states of the algorithm are the highest quality and are the 
global optimum. Based on the results of complete optimization, these four are the best candidates. An additional 
particle, whose concentration equals the average of the four particles mentioned above, is based on numerous 
experiments under various types of case issues. For other optimization algorithms, the number of particles selected 
is arbitrary. A vector named the equilibrium pool is constructed by combining five selected objects listed in (3). 


Ceq.pool T C'eqay C'eq2y C ‘eq(3) C 'eq(4) (3) 
The exponential term (F) contributes to the main concentration updating rule in (4). 
F = e7V(t-to) (4) 


In (5) time is defined as a function that decreases with an increase in the number of iterations (Njt¢,-). 


t= (1 _ _Mter \(az are ) (5) 


MaXiter ax_iter 


Here, a, is a variable that enables the exploitation skill to grow. As shown by (6), increasing exploration and 
exploitation abilities will allow us to easily achieve convergence by slowing down the search speed. 


to = s — kısign(h — 0.5)[1 — e7%+t]) +t 4 


Where, k, denotes the ability of exploration, sign (h — 0.5) provides direction for exploration and 
exploitation. Here h lies between [0, 1]. The modified version of (4) is written as: 


TELKOMNIKA Telecommun Comput El Control, Vol. 21, No. 1, February 2023: 159-167 


TELKOMNIKA Telecommun Comput El Control Oo 163 
= 
F =k.sign(h — 0.5)[e~%* — 1] (7) 
In addition, the generation rate is a crucial step that helps to provide a good exploitation phase to provide an 
exact solution to the optimization problem. The well-known 1 — D space model is one of many models to 
calculate generation rate is in (8). 


Hg = Hy.e? &-to) (8) 


Where Hy and @ represents the initial value and the decay constant respectively. To produce a more 
symmetrical and controlled search output, and then equation can be rewritten in (9). 


Eo = GCP(C., — $C) (9) 
Here, gneration rate control parameter (GCP) is the parameter of the control of the generation which 


represents the real probability of the update term. In conclusion, the (10) represents the following Ep 
updating rule: 


ee 
pV (1-M) 


C = Ceq + (P — Deq)M + (10) 
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Figure 3. Experimental framework for speech emotion recognition 


To perform good optimization cuckoo search follow the three basic rules: a) cuckoos lay one egg at a 
time, then dump it into a nest and try to choose at random; b) keeping healthy nests and passing down the best 
eggs to the next generations is the top priority; and c) the assumption is that the number of nests with available 
hosts is fixed and that the cuckoo’s eggs are discovered by the host birds with a probability of pa and @ (0,1). 
Alternatively, the host bird can remove the egg from the nest or abandon the nest and build a new one to 
achieve a successful hatch. The nests are updated by random Lévy flights in the first stage of the algorithm. 
The two feature selection algorithms pseudocode is given below. Algorithm | gives the procedure to find the 
best optimal feature set for the speech recognition model from the above main feature set. One more popular 
nature-inspired algorithm i.e. CS used to find the optimal feature set and improve the recognition 
performance. The pseudo-code is described in Algorithm 2. 
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Algorithm 1. Pseudo code for EO based FS model for SER [24] 


Input: generate initial population and feature space 
Output: this is the final combination of features (best option) 


1 Particle population is initialized as j = 1,2,3,...,n 

2 Give each equilibrium candidate a high fitness level 

3 Parameters can be freely assigned k, = 2, k = 1, GP = 0.5; 

4 While (i < max[i] do // read all the sub folders in dataset in main folder 

5 Fori = 1,..... n, n is number of particles do 

6 Determine each i;;, particle according to its fitness 

7 If fit (pi) < fit (Peqa)) then 

8 Replace Peq(ı) With p; and fit(p;) with fit(p;) 

9 Else if fit(p;) > fit(Peqa)) and fit(Peqc2)) then 

10 Replace Peq(2) with p; and fit(p2) with fit(p;) 

11 Else if fit(p;) > fit(peqay) and fit(p;) > fit(peqcy) and fit (Peq) and 
fit(p:) > fit (peq (3)) then 

12 Replace Peqgc3) With p; and fit(p;) with the fit(p;) 

i Else if fit(p;) > fit(Peqcay) and (pi) > fit(p:) > fit (Peq) ) and 
fit(p;) > fit(peq(3)) and fit(p;) < fit(Peqca))then 

14 Replace Pegc4) with p; and fit(p,) with the fit(p;) 

15 End if 

16 End for 

17 EPavg = (€Peq(1) + €pCeq(2) + €Deq(3) + CDeqca))/4 

18 Equilibrium pool EPeq.pool = (epeq(1), epeq(2), epeq(3), epeq(4), epeq(avg)) 

19 Ensure the saving of memory if (i > 1) 

20 Assign t = (1-i/max;)(k2.i/max;) 

21 For i = 1,...n, n is number of particles do 

22 From the equilibrium pool, choose a random candidate 

23 Generate random number y and m E = k1 X)sign(h — 0.5) x [exp — y.i — 1] 

24 Construct GCP = 0.5 - h if h > GP else 0 

25 Construct FO = GCP (Pq — ep) 

26 Update concentration P = Peg + (P- Peq) : M + G/YV x (1 - M) 

27 End for 

28 i=i+1 

29 End while 

Algorithm 2. Pseudo code for CS based FS model for SER [25] 
Input: max number of iterations, maximum population size, and maximum number of features 
Output: here is the final feature combination (best option) 

1 Begin 

2 Objective function f(x), where x = (x1,x2,x3,..... ,xd)? 

3 Initially populate n host nests x(i), i = 1,2,...,n 

4 While (t < Max_Generation) or (stop criterion) do 

5 Evaluate the quality or fitness of random cuckoos by Levy flights F; 

6 Pick one nest out of the many n (say, j) 

T If (Fi < Fj ) then 

8 Replace j by a new solution 

9 End if 

10 In (pa) of worst nests, a fraction is removed 

11 Choose the most efficient solution (or keep nests with it) 

12 Compare the current best solution to the solution ranked first 

13 Results of the post-processing and visualization 

14 End while 

15 End 
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5. EXPERIMENTAL RESULTS AND DISCUSSIONS 

We have relied on four prominent evaluating metrics like accuracy, F1 score, recall, and precision. 
These metrics are generated based on certain essential elementary measures contained in the confusion matrix. 
From the confusion matrix, we have calculated these parameters with the help of true positive, true negative, 
false positive and false-negative values. The two evolutionary algorithms above were developed with Python, 
openSMILE, and librosa tool kit. RAVDESS dataset contains a total of 400 speech samples of both males 
and females with different emotions in global language i.e. English. After applying the feature extraction to 
these data samples we got MFCCs and these dataset features are given to the SVM classifier to identify the 
emotion and estimate the accuracy of the model. In this work, to overcome the burden of classifier, the number 
of features is reduced using popular meta-heuristic approaches such as EO and CS. After applying the EO and 
CS algorithms individually, we achieved. 

To evaluate the impact of this meta-heuristic approach precision and Fl-score is the popular metrics 
used in the analysis of speech emotional classification. Here the Table 1 and Table 2 represents precision and 
Fl-score of EO and CS-based FS model for SER tasks and corresponding graphical representation shown in 
Figure 4 and Figure 5. By analyzing the above results, we can say that meta-heuristic-based FS models have 
superior performance compared to the traditional feature selection methods which were discussed in section 2. 

Finally, using this EO and CS algorithms-based FS model for speech emotion recognition accuracy 
is 89.64% and 92.71% respectively. Hence, most of the classification related problems like speech emotion 
recognition used these meta-heuristic optimizations and achieves impressive recognition rates. Table 3 shows 
the state of the art methods with our attempt using the meta-heuristic approach. In order to determine the 
impact of feature selection methods, the success rate obtained without any selection method is used as the 
reference value. From the above analysis, it is observed that, compared to traditional feature selection 
algorithms, the meta-heuristic approach is better accuracy for speech emotional intelligence. 


Table 1. Precision and F1-score using EO based Table 2. Precision and Fl-score using CS based 
FS model FS model 
Emotional Precision Fl-score Emotional Precision Fl-score 

Angry 0.84 0.77 Angry 0.84 0.82 
Happy 0.88 0.91 Happy 0.67 0.91 

Sad 0.79 0.85 Sad 0.93 0.89 
Disgust 0.86 0.79 Disgust 0.95 0.91 
Surprise 0.81 0.74 Surprise 0.91 0.89 
Neutral 0.82 0.75 Neutral 0.87 0.95 


Table 3. Comparision of existing work and proposed work 


No. Author Features used Classifier Accuracy (%) 
1 Ingale and Chaudhari [26] MFCCs + LPCC SVM 71 
2 Shambhavi and Nitnaware [27] |. MFCCs SVM 84 
3 El Ayadi, et al. [28] MFCCs + prosodic SVM 79 
4 Mustaqeem and Kwon [29] MFCCs CNN 79.50 
5 Issa et al. [30] MFCCs CNN 86.1 
6 Porposed (EO and CS) MFCCs SVM 89.64 and 92.71 
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Figure 4. Precision and Fl-score using EO-FS model Figure 5. Precision and Fl-score using CS-FS model 


6. CONCLUSION 

The main goal of this work is to achieve an impressive recognition rate with a smaller feature set. 
In real-time, the success rate was decreased due to the high dimensional feature set. To address this problem 
we attempted the FS model for SER using meta-heuristic approaches like EO and CS algorithm. During our 
experimentation, using normalization to reduce the feature set and increase the precision and Fl-score. 
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However, during this work, we have faced some challenges related to several iterations to achieve the best 
fitness. One more challenge is to perform manual feature engineering instead of automatic feature 
engineering. Hence, there will be room for applying these meta-heuristic approaches based on automatic 
feature engineering like deep learning. 
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